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Abstract 

We apply stochastic average gradient (SAG) algorithms for training conditional random fields (CRFs). 
We describe a practical implementation that uses structure in the CRF gradient to reduce the memory 
requirement of this linearly-convergent stochastic gradient method, propose a non-uniform sampling 
scheme that substantially improves practical performance, and analyze the rate of convergence of the 
SAGA variant under non-uniform sampling. Our experimental results reveal that our method often 
significantly outperforms existing methods in terms of the training objective, and performs as well or 
better than optimally-tuned stochastic gradient methods in terms of test error. 


1 Introduction 


Conditional random fields (CRFs) [Lafferty et al. 2001 are a ubiquitous tool in natural language processing. 


They are used for part-of-speech tagging McCallum et al. 2003 , semantic role labeling Cohn and Blunsom 
2005 , topic modeling Zhu and Xing 2010 , information extraction Peng and McCallum 2006 , shallow 


parsing Sha and Pereira 2003 , named-entity recognition Settles 2004 , as well as a host of other applications 


in natural language processing and in other fields such as computer vision Nowozin and Lamport 2011 


Similar to generative Markov random field (MRF) models, CRFs allow us to model probabilistic dependencies 
between output variables. The key advantage of discriminative CRF models is the ability to use a very high¬ 
dimensional feature set, without explicitly building a model for these features (as required by MRF models). 
Despite the widespread use of CRFs, a major disadvantage of these models is that they can be very slow 
to train and the time needed for numerical optimization in CRF models remains a bottleneck in many 
applications. 

Due to the high cost of evaluating the CRF objective function on even a single training example, it is 
now common to train CRFs using stochastic gradient methods Vishwanathan et al. 2006 . These methods 


are advantageous over deterministic methods because on each iteration they only require computing the 
gradient of a single example (and not all example as in deterministic methods). Thus, if we have a data set 
with n training examples, the iterations of stochastic gradient methods are n times faster than deterministic 
methods. However, the number of stochastic gradient iterations required might be very high. This has been 
studied in the optimization community, which considers the problem of finding the minimum number of 
iterations t so that we can guarantee that we reach an accuracy of e, meaning that 

/(w*) - fiw*) < e, and \\w* - < e, 

where / is our training objective function, w* is our parameter estimate on iteration t, and w* is the pa¬ 
rameter vector minimizing the training objective function. For strongly-convex objectives like ^ 2 -regularized 
CRFs, stochastic gradient methods require 0(l/e) iterations [Nemirovski et ah] 


2009 . This is in contrast 


to traditional deterministic methods which only require 0(log(l/e)) iterations [Nesterov 2004 . However, 
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this much lower number of iterations comes at the cost of requiring us to process the entire data set on each 
iteration. 

recently proposed the 


For problems with a finite number of training examples, Le Roux et al. 2012 


stochastic average gradient (SAG) algorithm which combines the advantages of deterministic and stochastic 
methods: it only requires evaluating a single randomly-chosen training example on each iteration, and only 
requires 0(log(l/e)) iterations to reach an accuracy of e. Beyond this faster convergence rate, the SAG 
method also allows us to address two issues that have traditionally frustrated users of stochastic gradient 
methods: setting the step-size and deciding when to stop. Implementations of the SAG method use both an 
adaptive step-size procedure and a cheaply-computable criterion for deciding when to stop. |Le Roux et al.| 


2012 show impressive empirical performance of the SAG algorithm for binary classification. 


This is the first work to apply a SAG algorithm to train CRFs. We show that tracking marginals in 
the CRF can drastically reduce the SAG method’s huge memory requirement. We also give a non-uniform 
sampling (NUS) strategy that adaptively estimates how frequently we should sample each data point, and 


we show that the SAG-like algorithm of Defazio et al. 2014 converges under any NUS strategy while a 


particular NUS strategy achieves a faster rate. Our experiments compare the SAG algorithm with a variety 
of competing deterministic, stochastic, and semi-stochastic methods on benchmark data sets for four common 
tasks: part-of-speech tagging, named entity recognition, shallow parsing, and optical character recognition. 
Our results indicate that the SAG algorithm with NUS often outperforms previous methods by an order of 
magnitude in terms of the training objective and, despite not requiring us to tune the step-size, performs as 
well or better than optimally tuned stochastic gradient methods in terms of the test error. 


2 Conditional Random Fields 


CRFs model the conditional probability of a structured output y & y (such as a sequence of labels) given 
an input x G X (such as a sequence of words) based on features F{x,y) and parameters w using 


p{y\x,w) 


eyLpinFF{x^ y)) 
T.y' exp{w'^F{x,y'))' 


( 1 ) 


Given n pairs {a:^, yi} comprising our training set, the standard approach to training the CRF is to minimize 
the £ 2 -i'egularized negative log-likelihood. 


min f{w) 

W 


1 \ 

-J2~'^ogp{y^\x^,w) + -\\w\\^, 


( 2 ) 


where A > 0 is the strength of the regularization parameter. Unfortunately, evaluating logp(jji\xi,w) is 
expensive due to the summation over all possible configurations y'. For example, in chain-structured models 
the forward-backward algorithm is used to compute logp{yi\xi,w) and its gradient. A second problem with 
solving © is that the number of training examples n in applications is constantly-growing, and thus we 
would like to use methods that only require a few passes through the data set. 


3 Related Work 


Lafferty et al. [2001] proposed an iterative scaling algorithm to solve problem ([^, but this proved to be 


inferior to generic deterministic optimization strategies like the limited-memory quasi-Newton algorithm 
L-BFGS [Wallach| 2002, Sha and Pereira 2003 . The bottleneck in these methods is that we must eval¬ 
uate logp{yi\xi,w) and its gradient for all n training examples on every iteration. This is very expensive 
for problems where n is very large, so to deal with this problem stochastic gradient methods were exam¬ 
ined [Vis^anathaneF^ 2006 Finkel et al. 2008| . However, traditional stochastic gradient methods require 
0(l/e) iterations rather than the much smaller 0(log(l/e)) required by deterministic methods. 
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There have been several attempts at improving the cost of deterministic methods or the convergence rate 
of stochastic methods. For example, the exponentiated gradient method of Collins et al. 2008 processes the 
data online and only requires 0(log(l/e)) iterations to reach an accuracy of e in terms of the dual objective. 
However, this does not guarantee good performance in terms of the primal objective or the weight vector. 
Although this method is highly-effective if A is very large, our experiments and the experiments of others 
show that the performance of online exponentiated gradient can degrade substantially if a small value of 
A is used (which may be required to achieve the best test error), see Collins et al. 2008, Figures 5-6 and 
Table 3] and Lacoste-Julien et al. 2013 Figure 1]. In contrast, SAG degrades more gracefully as A becomes 


small, even achieving a convergence rate faster than classic SG methods when A = 0 Schmidt et al., 2013 


Lavergne et al. 2010 consider using multiple processors and vectorized computation to reduce the high 


iteration cost of quasi-Newton methods, but when n is enormous these methods still have a high iteration 
cost. Friedlander and Schmidt 2012 explore a hybrid deterministic-stochastic method that slowly grows the 


number of examples that are considered in order to achieve an 0(log(l/e)) convergence rate with a decreased 
cost compared to deterministic methods. 

Below we state the convergence rates of different methods for training CRFs, including the fastest known 
rates for deterministic algorithms (like L-BFGS and accele rated gradient) [Nesterov 2004 , stochastic al¬ 
gorithms (like [averaged] stochastic gradient and AdaGrad) [Ghadimi and Laii 2012 , online exponentiated 
gradient, and SAG. Here L is the Lipschitz constant of the gradient of the objective, p. is the strong-convexity 
constant (and we have A < ^ < L), and cr^ bounds the variance of the gradients. 


Deterministic: 

OinJjjl 

1—1 

o 

(primal) 

Online EG 

0((n-f f)log(l/e)) 

(dual) 

Stochastic 

^ \ 

/—) 

(primal) 

SAG 

0((n-f |)log(l/e)) 

(primal) 


4 Stochastic Average Gradient 


Le Roux et al. [2012| introduce the SAG algorithm, a simple method 
gradient methods but that only requires 0(log(l/e)) iterations. To 
the classic gradient descent iteration as 


with the low iteration cost of stochastic 
motivate this new algorithm, we write 


n 



(3) 


where a is the step-size and at each iteration we set the ‘slope’ variables s* to the gradient with respect to 
training example i at w*, so that s* = —V logp{yi\xi, w*) + Xw*. The SAG algorithm uses this same iteration, 
but instead of updating s* for all n data points on every iterations, it simply sets s\ = —V \ogp{yi\xi, w*)+Xw* 
for one randomly chosen data point and keeps the remaining s* at their value from the previous iteration. 
Thus the SAG algorithm is a randomized version of the gradient algorithm where we use the gradient of 
each example from the last iteration where it was selected. The surprising aspect of the work of|Le Roux 


et al. 2012 is that this simple delayed gradient algorithm achieves a similar convergence rate to the classic 


full gradient algorithm despite the iterations being n times faster. 


4.1 Implementation for CRFs 

Unfortunately, a major problem with applying ^ to GRFs is th e re quirement to store the s*. While the 
CRF gradients logp{yi\xi,w*) have a nice structure (see Section 4.2), s* includes Xw* for some previous t, 
which is dense and unstructured. To get around this issue, instead of using ^ we use the following SAG-like 
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update Le Roux et al. 2012, Section 4] 


1 

= w* — a{ — d + Xw*) 
m 

ry 

= (1 — aX)w* - d, 

m 


( 4 ) 


where gj is the value of —V logp{yi\xi, w^) for the last iteration k where i was selected and d is the sum of 
the gl over all i. Thus, this update uses the exact gradient of the regularizer and only uses an approximation 
for the (structured) CRF log-likelihood gradients. Since we don’t yet have any information about these 
log-likelihoods at the start, we initialize the algorithm by setting g^ = 0. But to compensate for this, we 
track the number of examples seen m, and normalize d by m in the update (instead of n). In Algorithm]^ 
we summarize this variant of the SAG algorithm for training CRFsQ 


Algorithm 1 SAG algorithm for training GRFs 
Require: {xi,yi}, A, w, S 
1: m ^ 0, gi ^ 0 for i = 1, 2 ,..., n 

2 : d ^— 0 , Lg ^— 1 

3: while m < n and ||^d + Aw||oo > d do 
4: Sample i from {1,2,..., n} 

5: f< - \ogp{yi\xi,w) 

6 : gi - W\ogp{yi\xi,w) 

7: if this is the first time we sampled i then 

8; m •<— m -|- 1 
9; end if 

Subtract old gradient gi, add new gradient g: 
10 : d^ d- gi+ g 

Replace old gradient of example i: 

11: gi g 

12: if ||( 7 i|p > 10“® then 

13: Lg ^lineSearch(xi, y^, /, g^, w, Lg) 

14: end if 

15: d i — \j{Lg A) 

16: ui ^ (1 - a\)w - ^d 

17: Lg^ Lg- 2 - 1 /’" 

18: end while 


In many applications of CRFs the gl are very sparse, and we would like to take advantage of this as in 
stochastic gradient methods. Fortunately, we can implement Q without using dense vector operations by 
using the representation w* = j3*v* for a scalar /3* and a vector v*, and using ‘lazy updates’ that apply d 
repeatedly to an individual variable when it is needed Le Roux et al., [2012 


Also following Le Roux et al. 2012 , we set the step-size to a = 1/L, where L is an approximation to the 


maximum Lipschitz constant of the gradients. This is the smallest number L such that 

\\d7Mw) - V/*(i;)|| < L\\w-v\\, 


( 5 ) 


for all i, w, and v. This quantity is a bound on how fast the gradient can change as we change the 
weight vector. The Lipschitz constant with respect to the gradient of the regularizer is simply A. This 

"if we solve the problem for a sequence of regularization parameters, we can obtain better performance by warm-starting 
g®, d, and m. 
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gives L < Lg + X, where Lg is the Lipschitz constant of the gradient of the log-likelihood. Unfortunately, Lg 
depends on the covariance of the CRF and is typically too expensive to compute. To avoid this computation, 


as in 

Le Roux et al. 

|2012 we approximate L, in a 

given by Algorithm 

2 

Beck and Teboulle| 2009| 


since it uses function values (which only require the forward algorithm for CRFs) rather than gradient values 
(which require the forward and backward steps). Algorithm [^monotonically increases Lg, but we also slowly 
decrease it in Algorithm in order to allow the possibility that we can use a more aggressive step-size as we 
approach the solution. 


Algorithm 2 Lipschitz line-search algorithm 
Require: Xi,yi, f, gi,w, Lg. 

1: f'=-'^ogp{y^\xi,w - ^gi) 

2: while f > f - do 

3 ; Lg = 2Lg 

4: f =-logp{yi\x„w - ^g^) 

5: end while 
6 : return Lg. 

Since the solution is the only stationary point, we must have V/(w*) = 0 at the solution. Further, 
the value + Xw^ converges to \7f{w*) so we can use the size of this value to decide when to stop the 
algorithm (although we also require that m = n to avoid premature stopping before we have seen the full 
data set). This is in contrast to classic stochastic gradient methods, where the step-size must go to zero and 
it is therefore difficult to decide if the algorithm is close to the optimal value or if we simply require a small 
step-size to continue making progress. 


4.2 Reducing the Memory Requirements 

Even if the gradients g* are not sparse, we can often reduce the memory requirements of Algorithm [^because 
it is known that the CRF gradients only depend on w through marginals of the features. Specifically, the 
gradient of the log-likelihood under model Q with respect to feature j is given by 


Vg logp{y\x,w) 


Pjix,y) 


Y,y'exp{F{x,y'))Fj{x,y') 
J2y' exp{F{x,y')) 


Pj{x,y) - '^p{y'\x,w)Fj{x,y') 
y' 


|a:,u; [A) (x, y )] 


Typically, each feature j only depends on a small ‘part’ of y. For example, we typically include features of 
the form Fj(x, y) = F{x)\[yk = s] for some function F, where k is an element of y and s is a discrete state 
that yk can take. In this case, the gradient can be written in terms of the marginal probability of element 
yk taking state s. 


VgTogp( 2 /|x, w) = F{x)I[yk = s] - Ey,\^^.^[F{x)I[yk = s]] 

= F{x)(l[yk = s] - Eg/| 3 ,_^[I[yfc = s]) 

= F{x)0l[yk = s] - p{yk = s|x, w)). 

Notice that Algorithm]^ only depends on the old gradient through its difference with the new gradient (line 
10), which in this example gives 

Vj \ogp{y\x,w) - Vg \ogp{y\x,Wo\d) = F{x){p{yk = s|x, Wom) - p{yk = s\x,w)), 
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where w is the current parameter vector and Wo\d is the old parameter vector. Thus, to perform this 
calculation the only thing we need to know about Woid is the unary marginal p{yk = s|a^)Woid), which will 
be shared across features that only depend on the event that yk = s. Similarly, features that depend on 
pairs of values in y will need to store pairwise marginals, p{yk = s,y'f. = s'|x,r(;oid)- For general pairwise 
graphical model structures, the memory requirements to store these marginals will thus be OiVK + EK"^), 
where V is the number of vertices and E is the number of edges. This can be an enormous reduction since 
it does not depend on the number of features. Further, since computing these marginals is a by-product of 
computing the gradient, this potentially-enormous reduction in the memory requirements comes at no extra 
computational cost. 


5 Non-Uniform Sampling 


Recently, several works show that we can improve the convergence rates of randomized optimization algo¬ 


rithms by using non-uniform sampling (NUS) schemes. This includes randomized Kaczmarz Strohmer and 
randomized coordinate descent 


Vershynin, 2009 


Nesterov 2012 , and stochastic gradient methods Needed 


et al. 2014 . The key idea behind all of these NUS strategies is to bias the sampling towards the Lipschitz 


eonstants of the gradients, so that gradients that change quickly get sampled more often and gradients that 
change slowly get sampled less often. Specifically, we maintain a Lipschitz constant Li for each training exam¬ 
ple i and, instead of the usual sampling strategy pi = 1/n, we bias towards the distribution pi = Lij^j Lj. 
In these various contexts, NUS allows us to improve the dependence on the values Li in the convergence 
rate, since the NUS methods depend on L = (1/n) X), Fj, which may be substantially smaller than the usual 


dependence on L = maxj{Lj}. Schmidt et al. 2013 argue that faster convergence rates might be achieved 
with NUS for SAG since it allows a larger step size a that depends on L instead of 


The scheme for SAG proposed by Schmidt et al. 2013 Section 5.5] uses a fairly complicated adaptive 
NUS scheme and step-size, but the key ingredient is estimating each constant Li using Algorithm Our 
experiments show this method often already improves on state of the art methods for training GRFs by 
a substantial margin, but we found we could obtain improved performance for training GRFs using the 
following simple NUS scheme for SAG: as in Needell et al. [2014] , with probability 0.5 choose i uniformly 
and with probability 0.5 sample i with probability Liff^^ Lj) (restricted to the examples we have previously 
seen)j^ We also use a step-size of a = ^ {^1/L + l/L), since the faster convergence rate with NUS is due 
to the ability to use a larger step-size than 1/L. This simple step-size and sampling scheme contrasts with 
the more complicated choices described by jSchmidt et ah 2013 Section 5.5], that make the degree of non¬ 
uniformity grow with the number of examples seen m. This prior work initializes each Li to 1, and updates 
Li to 0.5Li each subsequent time an example is chosen. In the context of GRFs, this leads to a large number 
of expensive backtracking iterations. To avoid this, we initialize Li with 0.5L the first time an example is 
chosen, and decrease Li to 0.9Li each time it is subsequently chosen. Allowing the Li to decrase seems 
crucial to obtaining the best practical performance of the method, as it allows the algorithm to take bigger 
step sizes if the values of Li are small near the solution. 


5.1 Convergence Analysis nnder NUS 


Schmidt et ah] [2013] give an intuitive but non-rigorous motivation for using NUS in SAG. More recently, Xiao 


and Zhang 2014 show that NUS gives a dependence on L in the context of a related algorithm that uses 
occasional full passes through the data (which substantially simplifies the analysis). Below, we analyze a 
NUS extension of the SAGA algorithm of Defazio et al. 2014 , which does not require full passes through 


the data and has similar performance to SAG in practice but is much easier to analyze. 


^An interesting difference between the SAG update with NUS and NUS for stochastic gradient methods is that the SAG 
update does not seem to need to decrease the step-size for frequently-sampled examples (since the SAG update does not rely 
on using an unbiased gr adient estimate). 

‘ Needell et al. 2014 analyze the basic stochastic gradient method and thus require Ofl/c) iterations. 
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Proposition 1. Let the sequences {w*} and {s*} be defined by 


= w* — a 




npj 


^t+i ^ |v/r-*(u;‘) ifj=rt, 

^ 1 Sj otherwise. 

where jt is chosen with probability pj. 

(a) If rt is set to jt, then with a = we have 

E[|k* - ] < (1 - paf [||x° - x*|| + a] 


where Pmin = rmiii{pi} and 


Ca 


2p 


min 


(4L + npY 


f^-||V/,(crO)-V/,(cr*)f. 


(b) If Pj = L chosen uniformly at random, then with a = we have 

E[||w‘-w*f] < [l|a;°-a:*|| +C&] , 

where: 

Ct, = ^[f{x^)-f[x*)] 

This result (which we prove in Appendix A and B) shows that SAGA has (a) a linear convergence rate 
for any NUS scheme where pi > 0 for all i, and (b) a rate depending on L by sampling proportional to 
the Lipschitz constants and also generating a uniform sample. However, (a) achieves the fastest rate when 
Pi = Ijn while (b) requires two samples on each iteration. We were not able to show a faster rate using only 
one sample on each iteration as used in our implementation. 


5.2 Line-Search Skipping 

To reduce the number of function evaluations required by the NUS strategy, we also explored a line-search 
skipping strategy. The general idea is to consider skipping the line-search for example i if the line-search 
criterion was previously satisfied for example i without backtracking. Specifically, if the line-search criterion 
was satisfied ^ consecutive times for example i (without backtracking), then we do not do the line-search 
on the next 2^“^ times example i is selected (we also do not multiply Li by 0.9 on these iterations). This 
drastically reduces the number of function evaluations required in the later iterations. 


6 Experiments 


We compared a wide variety of approaches on four CRF training tasks: the optical character recognition 
(OCR) dataset of Taskar et al. 2003| , the CoNLL-2000 shallow parse chunking dataset the CoNLL-2002 
Dutch named-entity recognition dataset and a part-of-speech (POS) tagging task using the Penn Treebank 
Wall Street Journal data (POS-WSJ). The optimal character recognition dataset labels the letters in images 
of words. Chunking segments a sentence into syntactic chunks by tagging each sentence token with a chunk 
tag corresponding to its constituent type (e.g., ‘NP’, ‘VP’, etc.) and location (e.g., beginning, inside, ending, 


^ http: / / WWW. cnts. ua. ac .be/conll2000/ chunking 
®http://www. cnts. ua.ac.be/conll2002/ner 
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or outside any constituent). We use standard n-gram and POS tag features Sha and Pereira 2003 . For the 


named-entity recognition task, the goal is to identify named entities and correctly classify them as persons, 
organizations, locations, times, or quantities. We again use standard n-gram and POS tag features, as well 
as word shape features over the case of the characters in the token. The POS-tagging task assigns one of 
45 syntactic tags to each token in each of the sentences in the data. For this data, we follow the standard 
division of the WSJ data given by Collins 2002 , using sections 0-18 for training, 19-21 for development. 


and 22-24 for testing. We use the standard set of features following Ratnaparkhi 1996 and Collins 2002 


n-gram, suffix, and shape features. As is common on these tasks, our pairwise features do not depend on x. 

On these datasets we compared the performance of a set of competitive methods, including five variants on 
classic stochastic gradient methods: Pegasos which is a standard stochastic gradient method with a step-size 
of a = rj/Xt on iteration t Shalev-Shwartz et al. 201l| ® a basic stochastic gradient {SG) method where we 


use a constant a = ? 7 , an averaged stochastic gradient (ASG) method where we use a constant step-size a = ij 
and average the iterations ^ AdaGrad where we use the per-variable aj = r]/{S + Vj \ogp{yi\xi, w’^Y) 

and the proximal-step with respect to the ^ 2 -regularizer Duchi et al.[ 2011], and stochastic meta-descent 


(SMD) where we initialize with aj = rj and dynamically update the step-size Vishwanathan et al. 


2006 


Since setting the step-size is a notoriously hard problem when applying stochastic gradient methods, we let 
these classic stochastic gradient methods cheat by choosing the rj which gives the best performance among 
powers of 10 on the training data (for SMD we additionally tested the four choices among the paper and 


associated code of Vishwanathan et al. 2006 , and we found J = 1 worked well for AdaGrad)Our compar¬ 


isons also included a deterministic L-BFGS algorithm Schmidt 2005 and the Hybrid L-BFGS/stochastic 


algorithm of Friedlander and Schmidt 2012 . We also included the online exponentiated gradient (OEG) 


method Collins et al.j 2008 , and we followed the heuristics in the author’s codep] Finally, we included the 


SAG algorithm as described in Se ction 4, the SAG-NUS variant of] Schmidt et al. 
SAG-NUS* strategy from Section 5[ 


ITO 


2013 


and our proposed 

We also tested SAGA variants of each of the SAG algorithms, and 
found that they gave very similar performance. All methods (except OEG) were initialized at zero. 

Figure]^ shows the result of our experiments on the training objective and Figure]^ shows the result of 
tracking the test error. Here we measure the number of ‘effective passes’, meaning (1/n) times the number 
of times we performed the bottleneck operation of computing \ogp{yi\xi,w) and its gradient. This is an 
implementation-independent way to compare the convergence of the different algorithms (most of whose 
runtimes differ only by a small constant), but we have included the performance in terms of runtime in 
Appendix E. For the different SAG methods that use a line-search we count the extra ‘forward’ operations 
used by the line-search as full evaluations of \ogp{yi\xi, w) and its gradient, even though these operations are 
cheaper because they do not require the backward pass nor computing the gradient. In these experiments 
we used A = 1/n, which yields a value close to the optimal test error across all data sets. The objective is 
strongly-convex and thus has a unique minimum value. We approximated this value by running L-BFGS for 
up to 1000 iterations, which always gave a value of w satisfying ||V/(u;)||oo < 1-4 x 10“^, indicating that 
this is a very accurate approximation of the true solution. In the test error plots, we have excluded the SAG 
and SAG-NUS methods to keep the plots interpretable (while Pegasos does not appear becuase it performs 
very poorly), but Appendx C includes these plots with all methods added. In the test error plots, we have 
also plotted as dotted lines the performance of the classic stochastic gradient methods when the second-best 


®We also tested Pegasos with averaging but it always performed worse than the non-averaged version. 

^We also tested SG and ASG with decreasing step-sizes of either at = or at = ) 7 /(<S -f t), but these gave worse 

performance than using a constant step size. 

^Because of the extra implementation effort required to implement it efficiently, we did not test SMD on the POS dataset, 
but we do not expect it to be among the best performers on this data set. 

®Specifcially, for OEG we proceed through a random permutation of the dataset on the first pass through the data, we 
perform a maximum of 2 backtracking iterations per example on this first pass (and 5 on subsequent passes), we initialize the 
per-sample step-sizes to 0.5 and divide them by 2 if the dual objective does not increase (and multiply them by 1.05 after 
processing the example), and to initialize the dual variables we set parts with the correct label from the training set to 3 and 
parts with the incorrect label to 0. 

'^’^We also tested SG with the proposed NUS scheme, but the performance was similar to the regular SG method. This 
is consistent with the analysis of [Needell et ah] [2014| Gorollary 3.1] showing that NUS for regular SG only improves the 
non-dominant term. 


























































Figure 1: Objective minus optimal objective value against effective number of passes for different deter¬ 
ministic, stochastic, and semi-stochastic optimization strategies. Top-left: OCR, Top-right: CoNLL-2000, 
bottom-left: CoNLL-2002, bottom-right: POS-WSJ. 


step-size is used. 

We make several observations based on these experiments: 


SG outperformed Pegasos. Pegasos is known to move exponentially away from the solution in the early 

, meaning that ||w* —?u*|| > —ri;*|| for some p > 1, while SG 


iterations 


Bach and Moulines 


2011 


moves exponentially towards the solution (p < 1) in the early iterations Nedic and Bertsekas 2000 


ASG outperformed AdaGrad and SMD (in addition to SG). ASG methods are known to achieve the 
same asymptotic efficiency as an optimal stochastic Newton method [Polyak and Juditsky 


AdaGrad and SMD can be viewed as approximations to a stochastic Newton method. Vishwanathan 


et al. 

2006 

of 

Xu 

■|2oTo 


1992 


while 


did not compare to ASG, because applying ASG to large/sparse data requires the recursion 


Hybrid outperformed L-BFGS. The hybrid algorithm processes fewer data points in the early iterations, 
leading to cheaper iterations. 

None of the three algorithms ASG/Hybrid/SAG dominated the others: the relative ranks of these 
methods changed based on the data set and whether we could choose the optimal step-size. 


• OEG performed very well on the first two datasets, but was less effective on the second two. By 
experimenting with various initializations, we found that we could obtain much better performance 
with OEG on these two datasets. We report these results in the Appendix D, although Appendix E 
shows that OEG was less competitive in terms of runtime. 

• Both SAG-NUS methods outperform all other methods (except OEG) by a substantial margin based 
on the training objective, and are always among the best methods in terms of the test error. Eurther, 
our proposed SAG-NUS* always outperforms SAG-NUS. 
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Figure 2: Test error against effective number of passes for different deterministic, stochastic, and semi¬ 
stochastic optimization strategies (this figure is best viewed in colour). Top-left: OCR, Top-right: CoNLL- 
2000, bottom-left: CoNLL-2002, bottom-right: POS-WSJ. The dotted lines show the performance of the 
classic stochastic gradient methods when the optimal step-size is not used. Note that the performance of all 
classic stochastic gradient methods is much worse when the optimal step-size is not used, whereas the SAG 
methods have an adaptive step-size so are not sensitive to this choice. 


On three of the four data sets, the best classic stochastic gradient methods {AdaGrad and ASG) seem 
to reach the optimal test error with a similar speed to the SAG-NUS* method, although they require many 
passes to reach the optimal test error on the OCR data. Further, we see that the good test error performance 
of the AdaGrad and ASG methods is very sensitive to choosing the optimal step-size, as the methods perform 
much worse if we don’t use the optimal step-size (dashed lines in Figure]^. In contrast, SAG uses an adaptive 
step-size and has virtually identical performance even if the initial value of Lg is too small by several orders 
of magnitude (the line-search quickly increases Lg to a reasonable value on the first training example, so the 
dashed black line in Figure 2 would be on top of the solid line). 

To quantify the memory savings given by the choices in Section]^ below we report the size of the memory 
required for these datasets under different memory-saving strategies divided by the memory required by the 
naive SAG algorithm. Sparse refers to only storing non-zero gradient values. Marginals refers to storing all 
unary and pairwise marginals, and Mixed refers to storing node marginals and the gradient with respect to 
pairwise features (recall that the pairwise features do not depend on x in our models). 


Dataset 

Sparse 

Marginals 

Mixed 

OCR 

7.8 X 10-1 

1.1 X 10° 

2.1 X 10-1 

CoNLL-2000 

4.8 X 10-3 

7.0 X 10-3 

6.1 X 10-4 

CoNLL-2002 

6.4 X 10-^ 

3.8 X 10-4 

7.0 X 10-3 

POS-WJ 

1.3 X 10-3 

5.5 X 10-3 

3.6 X 10-4 
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7 Discussion 


Due to its memory requirements, it may be difficult to apply the SAG algorithm for natural language 
applications involving complex features that depend on a large number of labels. However, grouping training 
examples into mini-batches can also reduce the memory requirement (since only the gradients with respect to 
the mini-batches would be needed). An alternative strategy for reducing the memory is to use the algorithm 


of Johnson and Zhang 


or 


Zhang et al. 2013 . These require evaluating the chosen training example 


twice on each iteration, and occasionally require full passes through the data, but do not have the memory 
requirements of SAG (in our experiments, these performed similar to or slightly worse than running SAG at 
half speed). 

We believe linearly-convergent stochastic gradient algorithms with non-uniform sampling could give a 
substantial performance improvement in a large variety of CRF training problems, and we emphasize that 
the method likely has extensions beyond what we have examined. For example, we have focused on the 
case of £ 2 -regularization but for large-scale problems there is substantial interest in using .^i-regularization 
GRFs [Tsuruoka et al. 


2009, Lavergne et al. 2010 Zhou et al. 2011 . Fortunately, 


regularizers can be handled with a proximal-gradient variant of the method, see Defazio et al. 2014 


such non-smooth 
While 


we have considered chain-structured data the algorithm applies to general graph structures, and any method 
for computing/approximating the marginals could be adopted. Finally, the SAG algorithm could be modified 
to use multi-threaded computation as in the algorithm of Lavergne et al. 2010 , and indeed might be well- 
suited to massively distributed parallel implementations. 
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Appendx A: Proof of Part (a) of Proposition 1 

In this section we consider the minimization problem 


min f{x) 

X 




where each /' is L-Lipschitz continuous and each fi is /x-strongly-convex. We will define Algorithm 2, a 
variant of SAGA, by the sequences {a;^}, {I'k}, and given by 

n 

= —[fiM) - 4 (</>")] + - 

^ 1 (/)* otherwise, 

where jk = j with probability pj. In this section we’ll use the convention that x = x^, that (j)j = (f)j^ and 
that X* is the minimizer of /. We first show that Vk is an unbiased gradient estimator and derive a bound 
on its variance. 
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Lemma 1. We have E[^'fc] = f'{x^) and subsequently 


E\\,.,r<2E\\ — [f;{x)-f'{x* 

npj 


npj 


Proof. We have 
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2=1 2 = 1 


71 


To show the second part, we use that E||X — E[X] + y|p = E||X — E[X]|p + E||F|p if X and Y are 
independent, E||X — E[X]|p < E||X|p, and ||a: + yp < 2||a;jp + 2||y|p, 


1 ” 

Ell^fcf =E|| —[/'(cr) - /'(<^,)] + 
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2 = 1 


1 11 . 
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1 1 1 ^ 
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1 1 1 " 
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lipj itph 
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We will also make use of the inequality 


{f{x),x*-x) < -|||a:-a:*f - ^ E 


( 6 ) 


which follows from Defazio et al. 


2014 


Lemma 1] using that f'{x*) = 0 and the non-positivity of '^^y 7 ^[/(a;*) — 


f{x)]. We now give the proof of part (a) of Proposition 1, which we state below. 
Proposition 1 (a). If p = and Pm = then Algorithm 2 has 




- a;*|| + 
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Proof. We denote the Lyapunov function at iteration k by 


- E +cwx'^ - ■ 

n ^ npj 

We will will show that E[T^+^] < (1 — for some k < 1. First, we write the expectation of the first term 


as 
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= E 
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Next, we simplify the other term of E[T^+^], 


(7) 


cEjlx^'*'^ — x*\\^ = cE||x — X* — Efcll^ 
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= 4x - X*\\‘^ + -^E||i/fe|p + —(/'(x),x - X*) 

rj 


We now use Lemma followed by Inequality ([^, 
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We use this to bound the expected improvement in the Lyapunov function, 
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E[T'=+i] -T^= E[T’^+^] - - ^ - /'( 

n ^ npj 


- c\\x - a;*|P 


< 4 E 11 /^(^) - + 4 E f^ - 4 II/*- /*(^ 

i VP* / 


From Q 




)\\fl{x) - fi{x*)f + '^{ 4 - fiix*W From above 

n^rj^Pi 


,*||2 


1 "" 1 

--E—ii/*(^*)-/*(^*)ii'-"ii^-^ 

n ^ npj 

= 4 E ii/*(^) - /*(^*)ii' - 4 E ii/*(<^*) - /*(^* 


Def’n of T’^ 


CpL I 


+ E(^ 

-J 


2c c 

ri^rfpi n'qL 


)m^) - + E(^^)ii/*('/*) - /*(^ 


n^rj^Pi 


1 p 


= - r + - - - c\\x-x* 


E 

E 


K T] 

2c 1 c 

ri^p^pi v? nrjL 


(*) 


2c 


1 


1 


ri^p^pi v? v?Kpi 


ll/'(^)-/'{^*)f 


< - 

K 


2c 


1 _ /r 
K rj 

1 


\c\\x — X * 


n^V^Pm npL 


2c 


1 


1 


n^v'^Pm 'n? ri^npm 


Eii/*(^)-/*(^*)ii 
E II/* ('^*) “ /i(F 


*M|2 


where in (*) we add and subtract and in the last line we assumed c > 0 and used pi > Pm- The terms 
in square brackets are positive, and if we can choose the constants {c,K,r]} to make the round brackets 
non-positive, we have 

1 - 

For the hrst expression, choosing k = - makes it zero. We can make the third expression zero under this 

2 

choice of k by choosing c = . This follows because 
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For the second expression, note that with our choice of c we have 
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which (multiplying by n) is negative if we have 


2 ^ ^ M 

n 2L ~ nrjpm 2L 


Ignoring the last term, we can choose 


V = 


AL + nfi 
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We will also require that c > 0 to complete the proof, but this follows because ij > ^. By using that 
cE[||a;'=+i - a;*f ] < E[r'=+i] < (^1 - 
and chaining the expectations while using the definition of r] we obtain 

k rj^Q 
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To get the final expression, use that 
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Appendix B: Proof of Part (b) of Proposition 1 

In this section we consider the minimization problem 


min f{x) 

X 




where each /' is Li-Lipschitz continuous and / is /i-strongly-convex. We will define Algorithm 3 by the 
sequences {x^}, {vk}, and given by 


T 1 " 


Z=1 


- 7z/fc, 



if j = rk, 
otherwise, 


where jk = j with probability ' r and Xk is picked uniformly at random. This is identical to Algorithm 2, 
except it uses a specific choice of the pj and the memory (j)j is updated based on a different random sample 
that is sampled uniformly. This algorithm maintains the key property that the expected step is a gradient 
step, V,[vk] = fix'")- 


From our assumptions about / and the f, we have Nesterov 2004, see Chapter 2]. 


hif > hiv) + {fiy),x- y) 


2L 


mf-f 


( 8 ) 
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and 


fix) > fiy) + ifiv), x-y)+ ^\\x-y\\^. (9) 

We use these to derive several useful inequalities that we will use in the analysis. Adding the former times 
^ for all i to the latter times \ ioi y = x* gives the inequality 

{fix),x* -x) < fix*) - fix) - ^ ||a;* -a;f - II/*“ fii^)f ■ (10) 

i 

Also by applying ([^ with y = x* and x = 4>i, for each fi and summing, we have that for all (jji and x*: 

II/*-/^(^*)ii' ^ !E[/*(^*)-/(^*)-(/*(^*)>■ (11) 

i i 

Further, by both minimizing sides of (|^ we obtain 

-\\f\x)f<-2y[fix)-fix*)]. (12) 


We next derive a bound on the variance of the gradient estimate. 

Lemma 2. It holds that for any (fi that with x^'^^ and x^ as given by Algorithm 2 we have 

II t'Afh"-.) - t'.(.7:’)l'^ 

n ^ Li 




1 


n •*—' L 

i 

Proof. We again follow the SAGA argument closely here 
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We can expand those expectations as follows: 
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and similarly for E 
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We now give the proof of part (b) of Proposition 1, which we state below. 


Proposition 1 (b). //7 = y, then Algorithm 3 has 
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< 1 — min 
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3n SL) J 




Proof. We define the Lyapunov function as 


^ E - /(^*) - ^ E - ^*>+c I 

i i 

The expectations of the first terms in T^+i are straightforward to simplify: 


\x — X 
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1--) -E/*(<^')’ 
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Note that these terms make use of the uniformly sampled = x^ value. For the change in the last term 
of we expand the quadratic and apply E[a;^“'"^] = x^ — ^f'{x^) to simplify the inner product term: 

cE||x'=+^ -x*f 
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=c a: — X 
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We now apply Lemmato bound the error term cE ||x^^^ — 2 ;*^||^, giving: 
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Now we bound —2cj (/'(x), x — x*) with (10) and then apply (11) to bound E ||/j((/)j) — fj{x* 
cE ||x^'^^ — x*||^ < — y 7 /r^ 
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We can now combine the bounds we ha ve derived for each term in T, and pull out a fraction ^ of (for 
any k at this point). Together with (12) this yields: 

jgjyfc+ij _j,k ]_j,k + Q - 2c7 - 2c7V^ [/(*'') - fix*)] 

+ (^+4c7"T- 

^ _ i i 

+ ( 2^1 - 1) C7i ^ ^ \\fUx'^) - f:ix*)f . 
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K 2 
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(13) 


Note that the term in square brackets in the second row is positive in light of 0 - We now attempt to 
find constants that satisfy the required relations. We start with naming the constants that we need to be 
non-positive: 
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Recall that we are using the step size 7 = 1/4Z, and thus C 4 = 0. Setting ci to zero gives 
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which is positive since 7 /a < 1. Now we look at the restriction that C 2 < 0 places on k: 
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We also have the restriction from C 3 = 4 — | 7 /a of 
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therefore we can take 
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Figure 3: Test error against effective number of passes for different deterministic, stochastic, and semi¬ 
stochastic optimization strategies (this figure is best viewed in colour). Top-left: OCR, Top-right: CoNLL- 
2000, bottom-left: CoNLL-2002, bottom-right: POS-WSJ. 


Note that c\\x^ — x* 


< T^, and therefore by chaining expectations and plugging in constants we get: 


E 



2 ' 


< 


— min 


— -IL 

3n 8L 



^{fix^)-f{x*)) 
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Appendix C: Test Error Plots for All Methods 

In the main body we only plotted test error for a subset of the methods. In Figure we plot the test error 
of all methods considered in Figure 1. Note that Pegasos does not appear on the plot (despite being in the 
legend) because its values exceed the maximum plotted values. In these plots we see that the SAG-NUS 
methods perform similarly to the best among the optimally-tuned stochastic gradient methods in terms of 
test error, despite the lack of tuning required to apply these methods. 


Appendix D: Improved Results for OEG 

Owing to the high variance of the performance of the OEG method, we explored whether better performance 
could be obtained with the OEG method. The two most salient observations from these experiments where 
that (i) utilizing a random permutation on the first pass through the data seems to be crucial to performance, 
and (ii) that better performance could be obtained on the two datasets where OEG performed poorly by 
using a different initialization. In particular, better performance could be obtained by initializing the parts 
with the correct labels to a larger value, such as 10. In Figure we plot the performance of the OEG 
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Figure 4: Objective minus optimal objective value against effective number of passes for different variants 
of OEG. Top-left: OCR, Top-right: CoNLL-2000, bottom-left: CoNLL-2002, bottom-right: POS-WSJ. 

method without using the random permutation (OEG-noRP) as well as OEG with this initialization {OEG- 
10). Removing the random permutation makes OEG perform much worse on one of the datasets, while 
using the different initialization makes OEG perform nearly as well as SAG-NUS* on the datasets where 
previously it performed poorly (although it does not make up the performance gap on the remaining data 
set). Performance did not further improve by using even larger values in the initialization, and using a value 
that was too large lead to numerical problems. 


Appendix E: Runtime Plots 

In the main body we plot the performance against the effective number of passes as an implementation- 
independent way of comparing the different algorithms. In all cases except SMD, we implemented a C 
version of the method and also compared the running times of our different implementations. This ties 
the results to the hardware used to perform the experiments and to our specific implementation, and thus 
says little about the runtime in different hardware settings or different implementations, but does show 
the practical performance of the methods in this particular setting. We plot the training objective against 
runtime in Pigurej^and the test error in Eigurej^ In general, the runtime plots show the exact same trends 
as the plots against the effective number of passes. However, we note several small differences: 

• AdaGrad performs slightly worse in terms of runtime. This seems to be due to the extra square root 
operators needed to implement the method. 

• Hybrid performs worse in terms of runtime, although it was still faster than the L-BFGS method. 
This seems to be due to the higher relative cost of applying the L-BFGS update when the batch size 
is small. 

• OEG performed much worse in terms of runtime, even with the better initialization from the previous 
section. 
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Figure 5: Objective minus optimal objective value against time for different deterministic, stochastic, and 
semi-stochastic optimization strategies. Top-left: OCR, Top-right: CoNLL-2000, bottom-left: CoNLL-2002, 
bottom-right: POS-WSJ. 




Figure 6: Test error against time for different deterministic, stochastic, and semi-stochastic optimization 
strategies. Top-left: OCR, Top-right: CoNLL-2000, bottom-left: CoNLL-2002, bottom-right: POS-WSJ. 
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Finally, we note that these implementations are available on the first author’s webpage: 
http://www.cs.ubc.ca/~schmidtm/Software/SAG4CRF.html 


References 

F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine 
learning. Advances in Neural Information Processing Systems^ 2011. 

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. 
SIAM Journal on Imaging Sciences, 2(l):183-202, 2009. 

T. Cohn and P. Blunsom. Semantic role labelling with tree conditional random fields. Conference on 
Computational Natural Language Learning, 2005. 

M. Collins. Discriminative training methods for hidden Markov models: theory and experiments with 
perceptron algorithms. Conference on Empirical Methods in Natural Language Processing, 2002. 

M. Collins, A. Globerson, T. Koo, X. Carreras, and P. Bartlett. Exponentiated gradient algorithms for 
conditional random fields and max-margin Markov networks. The Journal of Machine Learning Research, 
9:1775-1822, 2008. 

A. Defazio, E. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for 
non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 2014. 

J. Duchi, E. Kazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic opti¬ 
mization. Journal of Machine Learning Research, 12:2121-2159, 2011. 

J. R. Einkel, A. Kleeman, and C. D. Manning. Efficient, feature-based, conditional random field parsing. 
Annual Meeting of the Association for Comptuational Linguistics: Human Language Technologies, 2008. 

M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal 
of Scientific Computing, 34(3):A1351-A1379, 2012. 

S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic com¬ 
posite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469-1492, 
2012 . 

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. 
Advances in Neural Information Processing Systems, 2013. 

S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe optimization for 
structural svms. International Conference on Machine Learning, 2013. 

J. Lafferty, A. McCallum, and E. Pereira. Conditional random fields: Probabilistic models for segmenting 
and labeling sequence data. International Conference on Machine Learning, 2001. 

T. Lavergne, O. Cappe, and F. Yvon. Practical very large scale CRFs. In Proceedings the 48th Annual 
Meeting of the Association for Computational Linguistics (ACL), pages 504-513, 2010. 

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence 
rate for strongly-convex optimization with finite training sets. Advances in Neural Information Processing 
Systems, 2012. 

A. McCallum, K. Rohanimanesh, and C. Sutton. Dynamic conditional random fields for jointly labeling 
multiple sequences. In NIPS Workshop on Syntax, Semantics, Statistics, 2003. 


22 



A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic Opti¬ 
mization: Algorithms and Applications, pages 263-304. Kluwer Academic, 2000. 

D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and the randomized 
Kaczmarz algorithm. Advances in Neural Information Processing Systems, 2014. 

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic 
programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009. 

Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004. 

Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim., 
22(2):341-362, 2012. 

S. Nowozin and C. H. Lampert. Structured learning and prediction in computer vision. Foundation and 
Trends in Computer Vision, 6, 2011. 

F. Peng and A. McCallum. Information extraction from research papers using conditional random fields. 
Information Processing & Management, 42(4):963-979, 2006. 

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on 
Control and Optimization, 30(4):838-855, 1992. 

A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. Conference on Empirical Methods 
in Natural Language Processing, 1996. 

M. Schmidt. minFunc: unconstrained multivariate differentiable optimization in Matlab, 2005. URL http; 
//www.cs.ubc.ca/~schmidtm/Software/minFunc.html, 

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arXiv 
preprint, 2013. 

B. Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. In 
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its 
Applications, 2004. 

F. Sha and F. Pereira. Shallow parsing with conditional random fields. Conference of the North American 
Chapter of the Association for Computational Linguistics: Human Language Technology, 2003. 

S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-gradient solver for 
svm. Mathematical programming, 127(l):3-30, 2011. 

T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal 
of Fourier Analysis and Applications, 15(2):262-278, 2009. 

B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. Advances in Neural Information 
Processing Systems, 2003. 

Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for Ll-regularized log-linear 
models with cumulative penalty. Annual Meeting of the Association for Computational Linguisitics, pages 
477-485, 2009. 

S. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated training of conditional 
random fields with stochastic gradient methods. linternational conference on Machine learning, 2006. 

H. Wallach. Efficient training of conditional random fields. Master’s thesis, University of Edinburgh, 2002. 

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM 
Journal on Optimization, 24(2):2057-2075, 2014. 


23 


W. Xu. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv 
preprint, 2010. 

L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full 
gradients. Advances in Neural Information Processing Systems, 2013. 

J. Zhou, X. Qiu, and X. Huang. A fast accurate two-stage training algorithm for Ll-regularized CRTs with 
heuristic line search strategy. International Joint Conference on Natural Language Processing, 2011. 

J. Zhu and E. Xing. Conditional Topic Random Fields. In International Conference on Machine Learning, 
2010 . 


24 



